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Abstract 


Kernel methods represent one of the most powerful tools in machine learning to tackle 
problems expressed in terms of function values and derivatives due to their capability to 
represent and model complex relations. While these methods show good versatility, they 
are computationally intensive and have poor scalability to large data as they require opera¬ 
tions on Gram matrices. In order to mitigate this serious computational limitation, recently 
randomized constructions have been proposed in the literature, which allow the applica¬ 
tion of fast linear algorithms. Random Fourier features (RFF) are among the most popular 
and widely applied constructions: they provide an easily computable, low-dimensional 
feature representation for shift-invariant kernels. Despite the popularity of RFFs, very lit¬ 
tle is understood theoretically about their approximation quality. In this paper, we provide 
a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) 
establishing optimal (in terms of the RFF dimension, and growing set size) performance 
guarantees in uniform norm, and (ii) presenting guarantees in L r (1 < r < oo) norms. 

We also propose an RFF approximation to derivatives of a kernel with a theoretical study 
on its approximation quality. 

1 Introduction 

Kernel methods [17] have enjoyed tremendous success in solving several fundamental problems of 
machine learning ranging from classification, regression, feature extraction, dependency estimation, 
causal discovery, Bayesian inference and hypothesis testing. Such a success owes to their capability 
to represent and model complex relations by mapping points into high (possibly infinite) dimensional 
feature spaces. At the heart of all these techniques is the kernel trick, which allows to implicitly 
compute inner products between these high dimensional feature maps, A via a kernel function k: 
fc(x, y) = (A(x), A(y)}. However, this flexibility and richness of kernels has a price: by resorting 
to implicit computations these methods operate on the Gram matrix of the data, which raises serious 
computational challenges while dealing with large-scale data. In order to resolve this bottleneck, 
numerous solutions have been proposed, such as low-rank matrix approximations [25, 6, 1], explicit 
feature maps designed for additive kernels [23, 11], hashing [19, 9], and random Fourier features 
(RFF) [13] constructed for shift-invariant kernels, the focus of the current paper. 

RFFs implement an extremely simple, yet efficient idea: instead of relying on the implicit feature 
map A associated with the kernel, by appealing to Bochner’s theorem [24]—any bounded, contin¬ 
uous, shift-invariant kernel is the Fourier transform of a probability measure—[13] proposed an 
explicit low-dimensional random Fourier feature map 0 obtained by empirically approximating the 
Fourier integral so that fc(x, y) ft: (<p(pc), (f>( y)). The advantage of this explicit low-dimensional 
feature representation is that the kernel machine can be efficiently solved in the primal form through 
fast linear solvers, thereby enabling to handle large-scale data. Through numerical experiments, it 
has also been demonstrated that kernel algorithms constructed using the approximate kernel do not 

‘Contributed equally. 


1 





suffer from significant performance degradation [13]. Another advantage with the RFF approach is 
that unlike low rank matrix approximation approach [25, 6] which also speeds up kernel machines, 
it approximates the entire kernel function and not just the kernel matrix. This property is particu¬ 
larly useful while dealing with out-of-sample data and also in online learning applications. The RFF 
technique has found wide applicability in several areas such as fast function-to-function regression 
[12], differential privacy preserving [2] and causal discovery [10]. 

Despite the success of the RFF method, surprisingly, very little is known about its performance guar¬ 
antees. To the best of our knowledge, the only paper in the machine learning literature providing 
certain theoretical insight into the accuracy of kernel approximation via RFF is [13, 22] Q it shows 
that A m := sup{|A;(x,y) - (</>(x), </>(y)) R 2 m | : x,y € S} = O p (sJ\og(rn)/m) for any compact 
set § C M. d , where m is the number of random Fourier features. However, since the approximation 
proposed by the RFF method involves empirically approximating the Fourier integral, the RFF esti¬ 
mator can be thought of as an empirical characteristic function (ECF). In the probability literature, 
the systematic study of ECF-s was initiated by [7] and followed up by [5, 4, 27]. While [7] shows 
the almost sure (a.s.) convergence of A m to zero, [5, Theorems 1 and 2] and [27, Theorems 6.2 and 
6.3] show that the optimal rate is 1 2 . In addition, [7] shows that almost sure convergence cannot 
be attained over the entire space (i.e., W 1 ) if the characteristic function decays to zero at infinity. 
Due to this, [5, 27] study the convergence behavior of A m when the diameter of § grows with m 
and show that almost sure convergence of A m is guaranteed as long as the diameter of § is e° <r " 1 . 
Unfortunately, all these results (to the best of our knowledge) are asymptotic in nature and the only 
known finite-sample guarantee by [13, 22] is non-optimal. In this paper (see Section[3]), we present 
a finite-sample probabilistic bound for A m that holds for any m and provides the optimal rate of 
to - 1 / 2 for any compact set S along with guaranteeing the almost sure convergence of A m as long 
as the diameter of S is e ol/rn> . Since convergence in uniform norm might sometimes be a too strong 
requirement and may not be suitable to attain correct rates in the generalization bounds associated 
with learning algorithms involving RFF0 we also study the behavior of fc(x, y) — (<^(x), </>(y))R 2 m 
in L r -norm (1 < r < oo) and obtain an optimal rate of to - 1 / 2 . The RFF approach to approximate 
a translation-invariant kernel can be seen as a special of the problem of approximating a function in 
the barycenter of a family (say T) of functions, which was considered in [14]. However, the approx¬ 
imation guarantees in [14, Theorem 3.2] do not directly apply to RFF as the assumptions on T are 
not satisfied by the cosine function, which is the family of functions that is used to approximate the 
kernel in the RFF approach. While a careful modification of the proof of [14, Theorem 3.2] could 
yield to - 1 / 2 rate of approximation for any compact set S, this result would still be sub-optimal by 
providing a linear dependence on |S| similar to the theorems in [13, 22], in contrast to the optimal 
logarithmic dependence on |S| that is guaranteed by our results. 

Traditionally, kernel based algorithms involve computing the value of the kernel. Recently, ker¬ 
nel algorithms involving the derivatives of the kernel (i.e., the Gram matrix consists of derivatives 
of the kernel computed at training samples) have been used to address numerous machine learn¬ 
ing tasks, e.g., semi-supervised or Hermite learning with gradient information [28, 18], nonlin¬ 
ear variable selection [15, 16], (multi-task) gradient learning [26] and fitting of distributions in an 
infinite-dimensional exponential family [20]. Given the importance of these derivative based ker¬ 
nel algorithms, similar to [13], in Section[4] we propose a finite dimensional random feature map 
approximation to kernel derivatives, which can be used to speed up the above mentioned derivative 
based kernel algorithms. We present a finite-sample bound that quantifies the quality of approxima¬ 
tion in uniform and // -norms and show the rate of convergence to be to - 1 / 2 in both these cases. 

A summary of our contributions are as follows. We 

1. provide the first detailed finite-sample performance analysis of RFFs for approximating kernels 
and their derivatives. 

2. prove uniform and 17 convergence on fixed compacts sets with optimal rate in terms of the RFF 
dimension (m); 

3. give sufficient conditions for the growth rate of compact sets while preserving a.s. convergence 
uniformly and in //; specializing our result we match the best attainable asymptotic growth rate. 


1 [22] derived tighter constants compared to [13] and also considered different RFF implementations. 

2 For example, in applications like kernel ridge regression based on RFF. it is more appropriate to consider 
the approximation guarantee in L 2 norm than in the uniform norm. 
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Various notations and definitions that are used throughout the paper are provided in Section[2]along 
with a brief review of RFF approximation proposed by [13]. The missing proofs of the results in 
Sections [3] and H] are provided in the supplementary material. 

2 Notations & preliminaries 

In this section, we introduce notations that are used throughout the paper and then present prelimi¬ 
naries on kernel approximation through random feature maps as introduced by [13]. 

Definitions & Notation: For a topological space X , C(X) (resp. Cb{X)) denotes the space of all 
continuous (resp. bounded continuous) functions on X. For / £ Cb(X), \\f\\x '■= sup^g^ |/(x)| 
is the supremum norm of /. Mb(X) and M_]_(X) is the set of all finite Borel and probability mea¬ 
sures on X. respectively. For p £ Mb(X), L r (X, p) denotes the Banach space of r-power (r > 1) 
/u-integrable functions. For X C R d , we will use L r (X) for L r (X, p) if p is a Lebesgue measure 

on T. For / £ L r (X,p), || /||L r (^r./i) := (fx |/| r d/u)denotes the L r -norm of / for 1 < r < 00 
and we write it as || • || ) if X C R d and p is the Lebesgue measure. For any / £ L 1 (A’,P) where 

P £ M\{X), we define P/ := / f(x) dP(x) and P m f := i E” 1 JW where (X z )™ =1 P, 
P m := A Ei=i is the empirical measure and 5 X is a Dirac measure supported on x £ X. 
supp(P) denotes the support of P. P m := Cdenotes the ?n-fold product measure. 

For v := (ui,..., Vd ) £ R d , ||v ||2 := ^/EiLi v i- The diameter of A C y where (y, p) is a metric 
space is defined as \ A\ p := sup{p(x, y) : x,y £ [V}. If y = R d with p = || • || 2 , we denote the diam¬ 
eter of A as |A|; \A\ < 00 if A is compact. The volume of A C R d is defined as vol(A) = / 4 1 dx. 
For A C R d , we define A& := A — A = {x — y : x, y £ A}. conv(A) is the convex hull of A. For 
a function g defined on open set B C R d x R d , <9 p ’ q g(x, y) := g^.g^d , ( x , y) £ B , 

where p, q £ are multi-indices, |p| = Ej=i Pj an d N := {0,1,2,...}. Define v p = J}y =1 v ^ 3 ■ 
For positive sequences (a n ) neN , a n = o(b n ) if Hindoo = 0. X n = O p (r n ) (resp. 

O a .s. (r n )) denotes that is bounded in probability (resp. almost surely). T(t) = J 0 °° x t ~ 1 e~ x dx 
is the Gamma function, T (2) = y/n and T(t + 1) = tT(t). 

Random feature maps: Let k : R d x R d —> R be a bounded, continuous, positive definite, 
translation-invariant kernel, i.e., there exists a positive definite function ip : R d —> R such that 
k(pc, y) = ip(x. — y), x, y £ R d where ip £ Cb( R d ). By Bochner’s theorem [24, Theorem 6.6], ip 
can be represented as the Fourier transform of a finite non-negative Borel measure A on R d , i.e., 

fc(x,y) =ip(x-y) = [ e VCTw T ( x — ; y)dA( w ) = [ cos (u> T (x - y)) dA(cj), (1) 

where (*) follows from the fact that ip is real-valued and symmetric. Since A(R d ) = ^(0), 
fc(x, y) = ip(Q) / dP(w) where P := £ M]_(R d ). Therefore, w.l.o.g., we 

assume throughout the paper that ^>(0) = 1 and so A £ Based on ([!]), [13] proposed an 

approximation to k by replacing A with its empirical measure, A m constructed from (w,;)™ 4 l 'A 1 ' A 
so that resultant approximation can be written as the Euclidean inner product of finite dimensional 
random feature maps, i.e., 

m 

A;(x, y) = — cos I ( x - y)) - (<M X ), 0 (y)) R 2 - , (2) 

Tfl 

2=1 

where <p(x) = yy=(cos (uij'x ),..., cos(ut^x), sin(u;fx),..., sin(uj^x)) and (*) holds based on 

the basic trigonometric identity: cos(a — b) = cos a cos b + sin a sin b. This elegant approximation to 
k is particularly useful in speeding up kernel-based algorithms as the finite-dimensional random fea¬ 
ture map <p can be used to solve these algorithms in the primal thereby offering better computational 
complexity (than by solving them in the dual) while at the same time not lacking in performance. 
Apart from these practical advantages, [13, Claim 1] (and similarly, [22, Prop. 1]) provides a theoret¬ 
ical guarantee that ||fc — fc||s x s —> 0 as m —> 00 for any compact set S C R d . Formally, [13, Claim 
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1] showed that—note that (0 is slightly different but more precise than the one in the statement of 
Claim 1 in [13]—for any e > 0, 

A m ({(u i )™ 1 :\\k-k\\ Sx§ >e})<C4Wae- 1 )^e-^, (3) 

where a 2 := / |M| 2 dA(u>) and C d := 2w ((f) ^ + (f) 3 ^) < 2 7 d^ when d > 2. The 

condition <r 2 < oo implies that ip (and therefore k) is twice differentiable. From 0 it is clear that 
the probability has polynomial tails if e < |S|cr (i.e., small e) and Gaussian tails if e > |S|cx (i.e., 
large e) and can be equivalently written as 

A m (j(w i )£L 1 : ||fc- fc||sxS > ISlay/m- 1 logmj) < m 4 ( d + 2 > (logm) - ^, (4) 

d +2 

where a := Ad — C d d |S| 2 er 2 . For |S| sufficiently large (i.e., a < 0), it follows from 0 that 

II k - fc||sxs = O p log to j . (5) 

While 0 shows that k is a consistent estimator of k in the topology of compact convergence (i.e., 
k convergences to k uniformly over compact sets), the rate of convergence of \/(log m)/m is not 
optimal. In addition, the order of dependence on |S| is not optimal. While a faster rate (in fact, 
an optimal rate) of convergence is desired—better rates in 0 can lead to better convergence rates 
for the excess error of the kernel machine constructed using k —, the order of dependence on |S| is 
also important as it determines the the number of RFF features (i.e., rn) that are needed to achieve 
a given approximation accuracy. In fact, the order of dependence on |S| controls the rate at which 
|S| can be grown as a function of m when m —> oo (see Remark \J}ii) for a detailed discussion 
about the significance of growing |S|). In the following section, we present an analogue of 0— see 
Theorem!]]— that provides optimal rates and has correct dependence on |S|. 


3 Main results: approximation of k 


As discussed in Sections [Hand [2] while the random feature map approximation of k introduced by 
[13] has many practical advantages, it does not seem to be theoretically well-understood. The exist¬ 
ing theoretical results on the quality of approximation do not provide a complete picture owing to 
their non-optimality. In this section, we first present our main result (see Theorem0 that improves 
upon 0 and provides a rate of mT 1 ^ 2 with logarithm dependence on |S|. We then discuss the con¬ 
sequences of TheoremQ]along with its optimality in Remark|T| Next, in Corollary[2]and Theorem[3] 
we discuss the L r -convergence (1 < r < oo) of k to k over compact subsets of 

Theorem 1. Suppose fc(x, y) = — y), x, y € where %p £ Cb{ K rf ) is positive definite and 

°- 2 : = I IMI 2dA M < oo. Then for any t > 0 and non-empty compact set § C R d , 


A” 


{Vi)iL i = ||&-fc||sxS > 


h(d, |S|, cr) + \/2t 


< e 


where h(d, |S|, a) := 32y / 2dlog(2|S| + 1) + 32y / 2dlog(cr + 1) + 16y / 2d[log(2|§| + l)] -1 . 

Proof (sketch). Note that ||fc - fc||s x s = sup xye § |fc(x,y) - fc(x,y)| = sup gea |A m g - Ag|, 
where Q := {g x y (u;) = cos(ai T (x — y)) : x, y £ §}, which means the object of interest is the 
suprema of an empirical process indexed by Q. Instead of bounding sup 9g g |A m g — Ag\ by using 
Hoeffding’s inequality on a cover of Q and then applying union bound as carried out in [13, 22], 
we use the refined technique of applying concentration via McDiarmid’s inequality, followed by 
symmetrization and bound the Rademacher average by Dudley entropy bound. The result is obtained 
by carefully bounding the L 2 (A m )-covering number of Q. The details are provided in Section IbTI 
of the supplementary material. □ 

Remark 1. (i) TheoremQ] shows that /.: is a consistent estimator of k in the topology of compact con¬ 
vergence as in —>■ oo with the rate of a.s. convergence being \J m -1 log S| (almost sure convergence 
is guaranteed by the first Borel-Cantelli lemma). In comparison to 0, it is clear that TheoremQ] 
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provides improved rates with better constants and logarithmic dependence on |S| instead of a linear 
dependence. The logarithmic dependence on |S| ensures that we need to = 0(e^ 2 log|S|) ran¬ 
dom features instead of 0(e” 2 |S| 2 log(|S|/e)) random features, i.e., significantly fewer features to 
achieve the same approximation accuracy of e. 

(ii) Growing diameter: While Theorem [I] provides almost sure convergence uniformly over com¬ 
pact sets, one might wonder whether it is possible to achieve uniform convergence over M. d . [7, 
Section 2] showed that such a result is possible if A is a discrete measure but not possible for A 
that is absolutely continuous w.r.t. the Lebesgue measure (i.e., if A has a density). Since uniform 
convergence of k to k over R“ is not possible for many interesting k (e.g., Gaussian kernel), it is 
of interest to study the convergence on S whose diameter grows with to. Therefore, as mentioned 
in Section [2j the order of dependence of rates on |S| is critical. Suppose |S m | —> oo as m —> oo 
(we write |S m | instead of |S| to show the explicit dependence on m). Then Theorem Q] shows that 
k is a consistent estimator of k in the topology of compact convergence if to -1 log |S m | —> 0 as 
to —> oo (i.e., |S m | = e o(jn ^) in contrast to the result in © which requires |S m | = o(y/m/log to). 
In other words, Theorem Q] ensures consistency even when |S m | grows exponentially in to whereas 
© ensures consistency only if |S m | does not grow faster than ^/m/logTO. 

(iii) Optimality: Note that ip is the characteristic function of A e Ml (R“) since ip is the Fourier 

transform of A (by Bochner’s theorem). Therefore, the object of interest ||/c — fc||sxS = \H’ — V'IUa- 
is the uniform norm of the difference between ip and the empirical characteristic function ip = 
m YXdLi cos( (<*>*, ■)), when both are restricted to a compact set §a C R d . The question of the con¬ 
vergence behavior of \\ip — V'||s A ' s not new and has been studied in great detail in the probability and 
statistics literature (e.g., see [7, 27] for d = 1 and [4, 5] for d > 1) where the characteristic function 
is not just a real-valued symmetric function (like ip) but is Hermitian. [27, Theorems 6.2 and 6.3] 
show that the optimal rate of convergence of ||ip — t/>||g A is tn~ x l 2 when d = 1, which matches 
with our result in TheoremQ] Also Theorems 1 and 2 in [5] show that the logarithmic dependence 
on |S m | is optimal asymptotically. In particular, [5, Theorem 1] matches with the growing diame¬ 
ter result in Remarkin';/), while [5, Theorem 2] shows that if A is absolutely continuous w.r.t. the 
Lebesgue measure and if limsup,,^^ to - 1 log |S m | > 0, then there exists a positive e such that 
limsup^^^ A m (||ip — ip ||§ m A > e) > 0. This means the rate |S m | = e°( m ^ is not only the best 
possible in general for almost sure convergence, but if faster sequence |S m | is considered then even 
stochastic convergence cannot be retained for any characteristic function vanishing at infinity along 
at least one path. While these previous results match with that of Theorem!]] (and its consequences), 
we would like to highlight the fact that all these previous results are asymptotic in nature whereas 
Theorem[I]provides a finite-sample probabilistic inequality that holds for any to. We are not aware 
of any such finite-sample result except for the one in [13, 22]. ■ 


Using TheoremQ] one can obtain a probabilistic inequality for the /A-norm of k — k over any 
compact set § C R d , as given by the following result. 

Corollary 2. Suppose k satisfies the assumptions in Theorem Q] Then for any \ < r < oo, r>0 
and non-empty compact set § C R d , 


A 15 


1 : II* - *IU. 


(S) > 


r d / 2 |S|° 


2/r 


h(d, |S|, cr) + v/2r 


2 d T(f+ 1) 


< e 


TO 


where \\k - fc|| L r (s) := \\k - felU^axS) = (/§ f s IM x >y) - fc(x, y)| r dxdy) 


Proof. Note that 

||fc-fc||Lr( g ) < ||fc-fc|| Sx SV0l 2/r (§). 

The result follows by combining Theorem Q] and the fact that vol(S) < vol(A) where A := 
|x e R d : ||x ||2 < and vol(A) = (which follows from [8, Corollary 2.55]). □ 

Corollary[2]shows that \\k — ^lli r (S) = Oa.sXm 1 / 2 |S| 2d / r y / Iog]§T) and therefore if |S m | —> oo as 
to —y oo, then consistency of k in L r (§ m )-normis achieved as long as m~ 1 ^ 2 \§ rn \ 2d ^ r y/log |S m | —>■ 
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0 as to —► oo. This means, in comparison to the uniform norm in TheoremQ]where |S m | can grow 
exponential in m s (<5 < 1), |S m | cannot grow faster than (6 > 0) to achieve 

consistency in L' -norm. 

Instead of using Theorem |T| to obtain a bound on ||fc — fellies) (this bound may be weak as ||fe — 
fc||L-(S) < ||fe ^ fe||sxSVol 2/r (S) for any 1 < r < oo), a better bound (for 2 < r < oo) can be 
obtained by directly bounding ||fe — fellies), as shown in the following result. 

Theorem 3. Suppose fe(x, y) = ip(x. — y), x, y £ where ft £ C(,(M“) is positive definite. Then 
for any l<r<oo, r>0 and non-empty compact set S C R d , 


A m 



life-fellt^s) > 


TT d / 2 \S\ d 

2 d r(| + i) 


2/r 


a 


m 


l-max{^. 



< e 


where C' r is the Khintchine constant given by C' r = 1/or r £ (1,2] and C' r = \/2 [T (ttil) / yfn\ r 
for r £ [2, oo). 


Proof (sketch). As in Theorem Q] we show that ||fe — fe|| L r( S ) satisfies the bounded difference 
property, hence by the McDiarmid’s inequality, it concentrates around its expectation E11 k — 
fe| U-(S)- By symmetrization, we then show that E||fe — k\\L r (S) is upper bounded in terms of 
E e £j c °s((tUi, • —-))|| ir (g), where e := (ei)^L 1 are Rademacher random variables. By 

exploiting the fact that L r (S) is a Banach space of type min{r, 2}, the result follows. The details 
are provided in Section HOl of the supplementary material. □ 

Remark 2. Theorem0shows an improved dependence on |S| without the extra ylog | S factor given 
in Corollary|2]and therefore provides a better rate for 2 < r < oo when the diameter of S grows, i.e., 
life — fe||i r (S m ) —> 0 if |S m | = o(m 53) as m —> oo. However, for 1 < r < 2, Theorem[3]provides 
a slower rate than Corollary [2] and therefore it is appropriate to use the bound in Corollary[2] While 
one might wonder why we only considered the convergence of ||fc — fc||£,»■(§) and not || k— fe||ir( R d), 
it is important to note that the latter is not well-defined because k £ L r ( R d ) even if k £ L r (M. d ). ■ 


4 Approximation of kernel derivatives 

In the previous section we focused on the approximation of the kernel function where we presented 
uniform and L r convergence guarantees on compact sets for the random Fourier feature approx¬ 
imation, and discussed how fast the diameter of these sets can grow to preserve uniform and L r 
convergence almost surely. In this section, we propose an approximation to derivatives of the kernel 
and analyze the uniform and L r convergence behavior of the proposed approximation. As motivated 
in Section Q] the question of approximating the derivatives of the kernel through finite dimensional 
random feature map is also important as it enables to speed up several interesting machine learning 
tasks that involve the derivatives of the kernel [28, 18, 15, 16, 26, 20], see for example the recent 
infinite dimensional exponential family fitting technique [ 21 ], which implements this idea. 

To this end, we consider k as in 0 and define h a := cos(^ + •), a £ N (in other words 
ho = cos, hi = — sin, fi -2 = — cos, /13 = sin and h a = h a mod 4 / For p, q £ N d , assuming 
/ |w p+q |dA(u>) < 00 , it follows from the dominated convergence theorem that 

<9 p ’ q fc(x, y) = [ w p (-u>) q fi| p+q | (w T (x - y)) dA(u;) 

J R d 

= f w p+q [fe|p|(w T x)fi.| q | (w T y) + /i 3+ |p|( u> t x)/i 3+ | q ,(w T y)] dA(w), 

J R d 

so that 9 p,q fc(x, y) can be approximated by replacing A with A m , resulting in 

m 

<9 p ’ q fc(x,y) := ,s p,q (x, y) = — ^ w J p (-u; J ) q li|p +q | (wj(x-y)) = (/ p (x),/ q (y)) R2m , (6) 

i=l 
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where </> p (u) := 


y/rn 


>i h \ P \(uTu),- 


\h | P | (w^u), uf ft 3+ | p | (wf u), ■ ■ ■ , < h 3+lpl (w£u)) 


and (u>j)JL 1 *'4 d ' A. Now the goal is to understand the behavior of ||s p q — 9 p ’ q fc||s x s and 
|| s p q _ 9 p ’ q fc||ir(§) for r £ [1, oo), i.e., obtain analogues of Theorems [T] and 0] 


As in the proof sketch of TheoremjT] while ||s p q —9 p,q fc||s x s can be analyzed as the suprema of an 
empirical process indexed by a suitable function class (say Q), some technical issues arise because 
Q is not uniformly bounded. This means McDiarmid or Talagrand’s inequality cannot be applied 
to achieve concentration and bounding Rademacher average by Dudley entropy bound may not be 
reasonable. While these issues can be tackled by resorting to more technical and refined methods, 
in this paper, we generalize (see Theorem 0] which is proved in Section IbTI of the supplement) 
Theorem[I]to derivatives under the restrictive assumption that supp(A) is bounded (note that many 
popular kernels including the Gaussian do not satisfy this assumption). We also present another 
result (see Theorem 0} by generalizing the proof techniquc[] of [13] to unbounded functions where 
the boundedness assumption of supp(A) is relaxed but at the expense of a worse rate (compared to 
Theorem0ji. 


,p+q| 


Cp.q :— E^^a 




p+q| 


M 


and 


Theorem 4 . Let p, q £ N“ T p , q := sup w6supp(A) 
assume that C- 2 p . 2 q < oo. Suppose supp(A) is bounded if p ^ 0 and q ^ 0. Then for any r > 0 
and non-empty compact set S C R d , 


A m 



||d p ’ q fc _ s p q 


SxS > 


H(d, p, q, |S|) + T p q \/2r | \ 

v 7 ™ fj 


< e 


where 

\/f/(p,q, |§|) + — — . + vAogfx/CVzq +~1) , 

2y tf(p, q. |S|) 

C/(p,q,|S|) = l 0 g(2|S|T-^ + l). 

Remark 3. (i) Note that Theorem 0] reduces to Theorem 0] if p = q = 0, in which case 
-^P.q = ^ 2 p, 2 q = 1 - If p ^ 0 or q 7 ^ 0, then the boundedness of supp(A) implies that T p q < 00 
and T 2 Pl 2 q < 00 . 

(ii) Growth of |S m |: By the same reasoning as in Remark 0J;7) and Corollary 0] it follows 
that ||d p ’ q fc - g p - q ||g mX 8 m ^4 0 if |S m | = e°( m ) and ||d p q fc - s p ' q || L . (Sm) ^4 0 if 
TO _1 / 2 |§ m | 2d /' r y/log |S m | —> 0 (for 1 < r < 00 ) as m —> 00 . An exact analogue of Theorem0]can 
be obtained (but with different constants) under the assumption that supp(A) is bounded and it can 
be shown that for r £ [2, 00 ), ||9 p,q A: — s p q ||L'-(s m ) —4 0 if |S m | = o(m^3). ■ 

The following result relaxes the boundedness of supp(A) by imposing certain moment conditions on 
A but at the expense of a worse rate. The proof relies on applying Bernstein inequality at the elements 
of a net (which exists by the compactness of §) combined with a union bound, and extending the 
approximation error from the anchors by a probabilistic Lipschitz argument. 

Theorem 5. Let p,q £ N d , be continuously differentiable, z 1 —> V z [<9 p,q A;(z)] be continuous, 
S C R d be any non-empty compact set, D p q g := sup zgcon „(§ A ) ||V Z [9 p,q fc(z )]|| 2 and E p q := 
Eu^a [|w p+q | ||u>|| 2 ]. Assume that E p q < 00 . Suppose 3L > 0, 0 > 0 such that 

Af 1 a 2 T M—2 

E^a [|/(z;o;)| m ] < —-- (VM > 2,Vz £ S A ), (7) 

3 We also correct some technical issues in the proof of [13, Claim 1 ], where (i) a shift-invariant argument was 
applied to the non-shift invariant kernel estimator fe(x, y) = -L 2 cos(u ;Jx + bj ) cos(caf y + hj ) = 

^ EJLi [cos(wf (x - y)) + cos(wf (x + y) + 2(ii) the convexity of S was not imposed leading to 
possibly undefined Lipschitz constant ( L ) and (iii) the randomness of A* = argmax AgS4 ||V[fc(A) — 
fc(A)] || was not taken into account, thus the upper bound on the expectation of the squared Lipschitz constant 
(E[L 2 ]) does not hold. 


ff(d,p,q,|S|) = 32 N /2dT 2p , 2q 
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where /(z; u>) = <9 p,q fc(z) — w p (—u;) q /i| p+q | (w T z). Define Fd := d ^+1 + 77ze« 

A”*({(«0^i:||9 p ’ q *-« P ' q ll8x8>e})< 


< 2 


d-l. 




4d — 1 

+ F d 2 ^n- 


|S|(-Dp,q.S + ^p,q) 


<Z+T — 


8 (ti+ l)J( 1+ ^) 


( 8 ) 


Remark 4. (i) The compactness of § implies that of §a- Hence, by the continuity of z 4 
V z [<9 p,q /c(z)], one gets -D p , q ,s < oo. © holds if |/(z;u>)| < ^ and E w ^a [|/(z;u;)| 2 ] < o’ 2 
(Vz G §a)- If su Pp(A) is bounded, then the boundedness of / is guaranteed (see Section [B/fl in the 
supplement). 


(ii) In the special case when p = q = 0, our requirement boils down to the continuously differen¬ 
tiability of if, E 0 0 = E^a ||ui|| 2 < and (|7}. 

(Hi) Note that ([8J is similar to and therefore based on the discussion in Section [2] one has 
||<9 p ’ q fc — s p ' q ||sxS = O a .s. (|S| y/ m- 1 log to). But the advantage with Theorem[5]over [13, Claim 
1] and [22, Prop. 1] is that it can handle unbounded functions. In comparison to Theorem [4] we 
obtain worse rates and it will be of interest to improve the rates of Theorem [5] while handling un¬ 
bounded functions. ■ 


5 Discussion 

In this paper, we presented the first detailed theoretical analysis about the approximation quality of 
random Fourier features (RFF) that was proposed by [13] in the context of improving the computa¬ 
tional complexity of kernel machines. While [13, 22] provided a probabilistic bound on the uniform 
approximation (over compact subsets of R d ) of a kernel by random features, the result is not opti¬ 
mal. We improved this result by providing a finite-sample bound with optimal rate of convergence 
and also analyzed the quality of approximation in L’ -norm (1 < r < oo). We also proposed an 
RFF approximation for derivatives of a kernel and provided theoretical guarantees on the quality of 
approximation in uniform and //'-norms over compact subsets of M. d . 

While all the results in this paper (and also in the literature) dealt with the approximation quality 
of RFF over only compact subsets of M , it is of interest to understand its behavior over entire R d . 
However, as discussed in Remark QJ/i) and in the paragraph following Theorem[3] RFF cannot ap¬ 
proximate the kernel uniformly or in /’ -norm over R d . By truncating the Taylor series expansion 
of the exponential function, [3] proposed a non-random finite dimensional representation to approx¬ 
imate the Gaussian kernel which also enjoys the computational advantages of RFF. However, this 
representation also does not approximate the Gaussian kernel uniformly over W l . Therefore, the 
question remains whether it is possible to approximate a kernel uniformly or in /' -norm over M' / 
but still retaining the computational advantages associated with RFF. 
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Supplement 


A Definitions & notation 


Let (Z,p) be a metric space, (f],A) a measurable space and Lq(Q,A) denotes the set of (fi,.A) i-» R measurable functions. 
A family of maps Q = {g z } z ^z C L 0 (Q,A) is called a separable Caratheodory family w.r.t. Z if (Z,p) is separable and 
z i—^ g-(cu) is continuous for all w g 11. Let Q C L 0 (fl, A), e = (ei,... ,£ m ) be a Rademacher sequence, i.e., e 7 -s are 
i.i.d. and P(&j = 1) = P(e ; - = —1) = and ( uij) 7 JL 1 £ The Rademacher average of Q is defined as 31 (G, u> 1:m ) := 

E £ srip sg g A SyLr £ j9( UJ j) ! we use the shorthand w ]:m = (a?!,..., u> m ). S' C Z is said to be an r-net of Z if for any 
z € Z there is an s £ S such that p(s. z) < r. The r-covering number of Z is defined as the size of the smallest r-net, i.e., 
A f(Z, p,r) = inf \t > 1 ; 3 si, ..., se such that Z C U j =1 B p (sj,r)}, where B p (s , r) = {z € Z : p(z, s) < r} is the closed 
ball with center s £ Z and radius r. log A f(Z, p , r) is called the metric entropy. A (Z, ||-||) Banach space is said to be of type 


q £ (1,2] if there exists a constant C £ R such that the E e Y^ijLi e jfj 


< C 


(E7=i \\fi\ 


holds for every finite set of 


vectors C Z. For example, L r (fl,A,p) spaces are of type q = min(2,r) [ 6 , page 73], where the C constant only 

depends on r (C = C r ). For a {Z. ||-||) normed space, Z* denotes the space of continuous linear functionals on Z. 


B Proofs 

We provide proofs of the results presented in Sections 0 and 0] Lemmas used in the proofs are enlisted in Section ICl 

B.l Proof of Theorems [Hand 0| 

Below we prove TheoremQ] thereby Theorem[l](p = q = 0). The idea of the proof is as follows: (i) We note that 

||<9 p ’ q fc- s p ' q ||sxS = sup |<9 p ' q fc(x, y) — s p,q (x, y)| = sup \ Ag — A m p| =: ||A — A m || g , (B.l) 

x.yGS 960 

where G := {g z : z £ Sa} and g z : supp(A) — > R, u) H > u; p (— u;) q ft| p+q | (u; T z), which means the object of interest is 
the suprema of an empirical process indexed by G■ (H) We show that ||A — A m || e is measurable w.r.t. A m by verifying that G 
is a separable Caratheodory family (see the discussion following Definition 7.4 in [9]). (iii) (IB.II ) can be shown to satisfy the 
bounded difference property in IC.ll and therefore by McDiarmid’s inequality (Lemma fC.il ). ||A — A m || e concentrates around its 
expectation, (iv) By applying the symmetrization lemma [9, Proposition 7.10] for the uniformly bounded function family G, we 
obtain an upper bound in terms of the expected Rademacher average of G■ (v) The Rademacher average is bounded by the metric 
entropy of G (making use of the Dudley’s entropy integral [2, Equation 4.4]), for which we can get an estimate by showing that 
G is a smoothly parametrized function class using the compactness of Sa- 

• G is a separable Caratheodory family: G is a separable Caratheodory family w.r.t. Sa since 

1 . g z : supp(A) —> R, uj H > u; p (—u>) q /i| p+q | (ui T z) is measurable for all z £ Sa- 

2. Sa C R d is separable since R d is separable. 

3. z H > uj p (— u>) q ft.| p+q | (uj t z) is continuous for all uj £ supp(A). 

• Concentration of ||A — A m || e by its bounded difference property: By defining /(uq,..., u> m ) := ||A — A m || g , we 
have that for Vi £ {1,..., m}, 

1/(^1, . . . , W,_1, U>i, Wj+1, . . . ,U) m ) — f(uJi, - - - , Ul'i, u>i+l, ■ ■ • ,w m )I = 


sup 


m z -/ 


i=i 


sup 

g£G 


< ~ sup (|p(wi)| + |p(w-)|) < — 
m ge g m 


m 

A 9 -5Z 9 ( w i) + — “ S'(w-) 

m m 

3 =i 

sup \g(Vi)\ + sup |p(Wi)l 

.960 960 


< — sup|p(u>i) - 5 (w-)| 

TTL g^Q 


1 


<-[K p+, i + lM) p+q i]<^L.. 

nn L J m 


Applying McDiarmid’s inequality (Lemma fC.il ) to /, for any r > 0, with probability at least 1 — e T over the choice of 


(« 0 £ 1 ■ A > 


ll A - A „ 


I g — 


ll A -E 


(B.2) 
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• Bounding E Wl . m ||A — A m |U: By the symmetrization lemma [9, Proposition 7.10] applied for the uniformly bounded 
function family Q (sup s£ g Halloo < T p q < oo), we have 

Eu 1;m ||A — A m || e < 2E(B.3) 

• Bounding 'R (Q. uj i :rra ): Using Dudley’s entropy integral [2, Equation 4.4], we have 

8\/2 f l e lr- 2 (A m ) 


RiG, Wi™) < / \/iogA/X^ L 2 (A m ), r) dr. 

/w Jo 


(B.4) 


The upper limit of the integral can be bounded as 

/ \ (*) _ 

|S|i 2 (A m ) = sup ||5i-ff 2 || L 2 (Am) < sup (llffill + ||ff 2 || L 2 (Am) J < 2sup|| 5 || i2(Am) < 2 v /r 2 p , 2 q, (B.5) 

n i /7o(=C rri finC-G ' ' nf^G 


91,9265 

where (*) follows from 


91,9265 


965 


sup||ff|| L 2 (Am) = sup 
965 zGSa 


A — X)flI(Wj)= sup 

\ TO ^ z G S a 


A -E K(-«i) q ^|p+q| K Z )] 2 < 

\ 9=1 




Bounding Af {G, L 2 (A m ), r) by the compactness of S A : For any <y Z| , <y Z2 G G, 

life - fe ll L 2 (Am) = ||w ^ (V+q| (^ Tz l) ^ *|p+q| (^^) ) 11 L 2 (Am ) ■ 

By the mean value theorem, there exists c G (0,1) such that 

|V+q| ( wTz l) - ^Ip+ql (^ Tz 2 ) I < || V z /l|p +q | (w T (cZ! + (1 - c)z 2 )) || K - Z 2 || 2 


where 

Therefore, 

life <7 z 2 II Zy 2 (A. m .) — 


| V z /i| p+q | (u9 T (cz! + (1 - c)z 2 )) || < |M| 2 . 


N 


1 m 

1 E 

m 

3 =i 


^ p+q l m 2 


2 

I Z 1 z 2 II 2 ) =||Z1-Z 2 | 


Ay 

\ m ' 

\ 3 =i 




2 (p+q) 


"ilia- 


LB shows that the existence of an e-net on (S A , ||-|| 2 ) implies an r = ey A 
In other words, 


2(p+q) 


Wj|| 2 -net on {Q,L 2 { A m )). 


/ 


M{g,L 2 (A m ),r) <Af 


SA,|M| 2 ,r[ly 

\ m rr' 


. 2(p+q) 
Ct/ • 


V 


9=1 




9 II2 


Define 


Rlp.q : — 


\ 


Ay L 2(p+q) 

m^l J 

3 =1 


J 112 ' 


By using the fact that S A C i7 ||.|| 2 ^t, for some t G R d and A/"(-B||.|| 2 (s, R), ||-|| 2 , e) < (4-p + l) d for any s G 
[10, Lemma 2.5, page 20], we obtain 


A( 5 ,f 2 (A m ),r)<^ + l) , 


(B.7) 


by noting that |S A | < 2|S|. Using (IB. 51) and (IB .7b in (IB. 41) . we have 

8 V2d f 2 V T ^^ 


3^ (Q, < 


'log 


4|§|A p , q \ 8 V2d 


+ 1 dr < 


^ / 4|§|Ap. q + 2 v ^V 

\ V 


J dr, 

(B. 8 ) 
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where in the last inequality we used the fact that r < 2^/T2 Pi 2 q - By bounding 2|S|A p . q + p . 2 q < (2|S| + 
\/ I 2p,2 q )(-4p, q + 1), dB.8| > reduces to 


R(G,ui :m ) < 



/ log 2 (2|S| + V^p.a q) dr + 2v /T 2p , 2 q log(A p , q + l) 


16\/2d 


V t 2pM (^' ^log jgp ’^, + — dr + \A°g(Ap,q + 1 )j 


where the last equality is obtained by changing the variable of integration and defining /J pq := 
Lemma fC.2| to bound the integral in ( IB .9b . we obtain 
16^2 d 


(B.9) 

—. By applying 




3? {G, < 


- ^72p,2cj 


log(i? P , q + 1) + 


2\/log(5p, q + 1) 


+ \/log(^4p, q + 1 ) • 


Bounding the expectation of the Rademacher average: From (IB. 1 Ob . we have 


(G, Wl :m ) < -7=“-\/?2p,2q 


TO 


Y/l0g(B p . q + 1) + 


2\/log(Bp, q + 1) 


J^ogJy^ .2c, 


+ 1 


(B.10) 


(B.l 1) 


which is obtained by repeated applications of Jensen’s inequality to bound E wi . m yTog(A p q +1) < 


;m log(^4 P , q + 1) < y / log(E tc , 1:m A pq + 1) where E Wl;m A P!q < y ^ ®<* 

• Final bound: Combining (IB.2b . (IB. 3b and (IB. 1 lb yields the result. 

B.2 Proof of Theorem [3] 


2(p+ q ) 


“J II 2 


< V^P^q- 

□ 


Below we prove Theorem[3] (i) We show that f(uj i,..., uj m ) := ||fc — fc||satisfies the bounded difference property, hence 
by the McDiarmid’s inequality (Lemma 1C. lb it concentrates around its expectation E||fc — k\\L r (S). (ii) By L r (S) = [L ? (S)]* 
(i + i = 1), the separability of L r (§) and the symmetrization lemma [11, Lemma 2.3.1] the value of E||fc — fc|| £»■(§) is 
upper bounded in terms of E e ||^™ i cos((tUj, • — ■))||z^»~(S)• (iu) Exploiting that Z/(S) is of type min(r, 2) with a constant 
independent of S, we get the result. 

• Concentration of \\k — A:|| i r( S) by its bounded difference property: Define k,(x, y) = A- Y2j& cos(wj (x — y)) + 
— cos(cjf (x — y)) where is an i.i.d. copy of cu,. Then ||fc — /c||£r(§) satisfies the bounded difference property in (1C.lb : 

\\k - fclU'-(s) - \\k - fcilU’-(S) < SU P \\h ~ fc|U--(s) < — sup||cos((u>i,- })||l*-(s) < —vol 2/r (§) 

171 “i m 


sup 

(wi)52=i,u»i 

and therefore by McDiarmid’s inequality (Lemma fC.lb . for any r > 0, with probability at least 1 — e~ T over the choice of 
(u)i)^l 1 ~ A, we have 


I k - fc|U. (s) < E Wl;m || k - k\\ L r (s) + vol 2 / r (§)^. 


(B.12) 


i + i 

r r 


Symmetrization, reduction to E e ||^™ 1 Si cos((u>i, • — ■))|| i r(§p Let f be the dual exponent of r, in other words ■ 

1. Then, by Z/(S) = [Z/(S)]* and the separability of L r (§). there exists (see Lemma lC4l ) a countable Q C L r (S) ( \/g £ G, 
IMIzy(s) = !) such that 


I k - felling) = sup 
a 6(5 


ff(x, y) k(x, y) — k(x,y) 


'SxS 


dxdy 


(B.13) 


One can rewrite the argument of this supremum by Eqs. CD-© i 


/ SxS 


g(x, y) k(x, y) - k(x, y) dxdy = 


cos(tu T (x - y))d(A - A m )(tu) 


/ s(x, y) 

Jsxs L 

/ / 5 (x,y)cos(w T (x-y))dxdy 

Js. d Ldsxs 


dxdy 


d(A - A m )(cu), 
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and thus 


life - fe|li r (S) = sup |(A - A m )g\, 


(B.14) 


where Q := {g g : g £ Q}, g g (uj) = f SxS g(x, y) cos(o; T (x — y))dxdy and g g is continuous. Hence, using (IB.14l > with 
the symmetrization lemma [11, Lemma 2.3.1] and (IB. 13l i. we have 


lu 1;m life - fe||L r (s) < 2E Wl;m E e sup 

§6(5 


2 

= —E„ l!m E«sup 

m g£G 


/ SxS 


5(x,y) 


m ' 

2=1 

E £i cos (wf (x - y)) 


= —E Wl E e sup 

ttl ggp 


i=l 


dxdy 


m r. 

Y 5(x,y)cos (wf(x-y)) dxdy 
i=l JsxS 

m 

E £i COs((Wi,-}) 


= —E Ul . m E e 
m 


2=1 


(B, 15) 

L-(S) 


where is a Rademacher sequence and E £ is the conditional expectation w.r.t. (e i )' l ?L 1 with being the 

conditioning random variables. Notice that the measurability of g g -s with the countable cardinality of Q enabled us to 
write expectations instead of outer expectations in [11, Lemma 2.3.1, page 108-110], and hence in Eq. (IB.15b . 


Bounding E e i e* cos((u>i, • - 


I L r {§) 


by the type of L r ($): 


E £ 


E £i COs((u>i, • - •}) 


i= 1 


(*) 




< C' r II ^ C>l 2 / r (§)m“«^, (B.16) 


L-( S) 


vi=l 


since L r (§) is of type min(2,r) [6, page 73] and there exists a universal constant C' r independent of S (the so-called 
Khintchine constant) [5, page 247] such that (*) holds; in addition we used 


El|cos«^,-- 


and 1 - r 1 


imin{2,r} _ V - ^ 


I SxS 


|cos(wf (x — y))| r dxdy ) < m [vol 2 (S)] 


min{2,r} 


min{2,r} 


= max 


12 ’ r S’ 


min{2 ,r} 

Combining (IB. 12b — (IB. 16b and using the bound on vol(S) given in the proof of Corollary [2] yields the result. 


□ 


B.3 Proof of Theorem H] 

Below we give the detailed proof of Theorem[5] At high-level the proof goes as follows: (i) By the compactness of Sa (implied 
by that of §) one can take an r-net covering §a (for any r > 0). (ii) Small approximation error can be guaranteed at the 
centers of the ?’-net by Bernstein’s inequality combined with a union bound, (iii) Propagation of the error from the centers to 
arbitrary points is achieved by Lipschitzness. (iv) The Lipschitz constant is, however, a random quantity and we show with high 
probability that it is ‘not too large’, (v) Union bounding the two events (small errors at the centers and small Lipschitz constant) 
leads to a uniform bound for arbitrary r, which holds with high probability, (vi) Optimizing over r gives the stated result. 

Formally, the proof is as follows. Let us define 


B, 


p.q.S 


:=E, 


sup ||V z /(z;u>)|| 2 
zGcom)(§A) 


where /(z;in) = t) p,q fc(z) — u; p (—u;) q /i| p+q | (tu T z). Let us notice that since conv(§ a) is compact (by the compactness of 
Sa. implied by that of S) and z i —> ||V z /(z; tu)|| 2 is continuous, the supremum inside the expectation in f? p q § is finite for any 

UJ. 


• Covering of Sa: By the compactness of Sa there exist an r-net with at most 


N = 




(B.17) 


balls covering Sa [10, Lemma 2.5, page 20], where we used that |Sa| < 2|S|. Let us denote the centers of this r-net by 

Ci, . . . , C^y. 
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Bounding /(b; u>i ;m ) - /(a; where a,be§ A ; w 1;ro = (wi,..., cu m ) is fixed: Let 

1 m m 

/(z;w 1:m ) = — X)/( z ; w j) = ~E [ SP,qfc ( z ) - Wj(-«i) q *|p+q| ( W J Z )] ■ 
i=i .7=1 

z H> /(z; u; 1:m ) is continuously differentiable since ip is so. Thus by the mean value theorem 3 1 G (0,1) such that 

/(b; wi :m ) - /(a; w i:m ) = (V z /(ta + (1 - f)b; u>i :m ), b - a) . 

Hence by the Cauchy-Bunyakovsky-Schwarz inequality, we get 

|/(b;w 1:m ) - /(a;u>i :m )| < ||V z /(fa+ (1 - t)b;uj 1:m )\\ 2 ||b - a || 2 < sup || V z /(z; u > 1;m )|| 2 ||b - a || 2 

z Econu(SA) 

=: i(w 1:m ) ||b — a || 2 , (B.18) 

where we used the compactness of conv(§ a) (implied by that of S A ) and the continuity of the z i—||V z /(z; tui; m )|| 
mapping to guarantee that L{u3i- m ) exists, and it is finite for any u>i :m . 


Bound on E Ul> ... )Um [L(u>i :m )]: Using the definition of /(z;u>i :m ), the linearity 
equality, we get 


ot amerentiation, ana tne triangle 




| || 2 — 




3= 1 


^ m 

-£v,/(z ;«j) 


j=i 


f=i 


J / II 2 ’ 


Therefore, 


and 


sup 

zGconu(SA) 


^ m 

||V z /(z;wi :m )|| < — V' sup ||V z /(z;w 
m ^ =1 zeconv(§A) 


3 ) II2 


®wi:m[^( w l:m)] - ®u 1;m 

sup ||v z /(z;w 1:m )|| 2 

1 m 

sup ||V z /(z;wy )|| 2 


zGconu(SA) 

171 z — 

i=i 

z(PzConv(S a) 


- m Spq ' S - /; p .r 


TO z 
J =1 


(B.19) 


Bound on /i p q g: Note that 

sup ||V z /(z;w )|| 2 = sup ||V Z [5 p ' q /c(z) - u; p (-a;) q /i| p+q | (ui T z)] || 2 

rnn.n(& a 'l S a ^ 


zGconu(SA) 


zGcont>(§A) 

< sup (||V Z [a p ' q /c(z)]|| 2 + ||V z [w p (-u;) q /i| p+q | (u T z)] || 2 ) 

zGcon , u(§A) 

< sup || V z [9 p,q A:(z )]|| 2 + sup || V z [u; p (-u;) q /i| p+q | (w T z)] | 

zGcont>(§A) zGconu(SA) 

= £>p.q,S + SUp || V z [w P ( —Ul) q /l| p+q | (^ T z)] || 2 . 


zGconu(SA) 

By the homogenity of norms (||av|| = |a| ||v||), the chain rule, and \h a (v)\ < 1 (Vo, Vv) 
|| V z [u; p (-u;) q / 7 ip+q | (ta T z)] || 2 = |u> p+q | ||/i| p +q|+i (w T z) 
Combining Eq. (IB.20b and (IB.21 b results in the bound 


o>|| 2 < |w 


,p+q| 


M 


2 ’ 


(B.20) 

(B.21) 




'p,q,S — 


sup ||v z /(z;u >)|| 2 

z(zConv(§>A.) 


< B P , q .s + E^a [|o> p+q | ||cu|| 2 ] = Z9 p . q , s + E p , q . (B.22) 

Error propagation from the net centers: We will use the following note to propagate the error from the net centers (c j, 
j = 1,..., N) to an arbitrary z G § A point. Note: If |/(cj-; tui :m )| < | (Vj) and m) ^ 2r’ then 

|/(z;oi 1:m )| < e (Vz G S A ). (B.23) 

Indeed 

||/(z;wi :m )| ~ |/(cj;a> 1:m )[ | < |/(z;oi 1:m ) - f(cj\ o> 1:m )| < L(w 1:ro ) ||z - Cj|| 2 < 




where we used ( IB. 18b and our assumptions in the note, thereby yielding ( IB.231 ). 




























Guaranteeing the conditions of (IB .23b with high probability: 

- Notice that [/(z; w)] = 0 (Vz). Also since 0 holds, applying Bernstein’s inequality for the individual c 3 points 
(Lemma lCAl '■= f(cj ; u>„), n = 1,..., m; S := \Jrno) gives that for any rj > 0 


A m ( |/( c j;tu 1:m )| > 


r](j 

fm 


1 1- 
2 i+ tL 

< e . 


(B.24) 


Setting e = ( IB.24l > is written as 


.x W 


C j i w l:m)| < r, ) > 1 


A” i u ^ 

By union bounding (j = 1...., N ), we get 


1+—2^ 


= 1 — e 


A m (nf =1 {|/( Cj ;u, 1:m )| < |}) > 1 - TVe . 


(B.25) 


- Condition L(uJi :rn ) < y-: Applying Markov’s inequality to L(u>i :rn ) (note that L(uj\ :rn ) is non-negative), for any 
t > 0, we obtain 

|P 

A m (L(cj 1;m ) >t)< u 


u (jJ\ .... .UJ-m [L{oj i )] < ®p.q,S + £p q 


t 


t 


by invoking (IB.19b and ( IB.22b . Choosing t = y-, we have 


A m (L(wi :m ) < > 1 - y (£> p . q ,s + Ep, q ). 

Final bound for any r > 0: By ( IB. 25b and ( IB.26b . and substituting the explicit form of N in ( IB.17b . we get 

A m ( sup |/(z;u> 1:m )| < e) > A m ({L(w 1:m ) < f]r\f =1 {|/(cj;w 1:m )| < 

1 - (^ + 1 - ^(F p . q , s + F p>q ) > 1 - c* - 


(B.26) 




> 


«ir — K 2 r, 

(B.27) 


where we invoked the 




+ 1 = 


' 4|S| 




= 2 


lV(t) , 1 


2 2 


< 2 


4|S| 


+ l a 


= 2 


d—1 


4jS| 

r 


+ 1 


Jensen’s inequality in (f), c» := 2 d_1 e 8<l2 ( 1+ ^I, := 4 d |S| d c» and K 2 = f (Up.q.s + F Piq ). 

Matching the two terms to choose r: Maximizing w.r.t. r in ( IB.27b 

/(r) = ki r~ d + n 2 r =>• /'(r) = Ki(— d)r _d_1 + K 2 = 0 => —= r d+1 

K2 


note that r = d+1 maximizes it. Using this in (IB.27b . we have 


A m ( sup |/(z; > e ) < c* + Ki 

VzGS a 


dn i \ ^ \ 3 + T 


k 2 


«2 


«2 


1 d 
cI+T _ 3+T 


— C+ + Fd^j l ^2 


= 2 d_1 e 8<t2 ( 1+ ^) + F c 
= 2 d ~ 1 e~^ 2 C^) +F d T& 


2 3d_1 |S| rf e 8<t2 ( 1+ ^) 2 


'(^p.q.S + kp q ) 


d 

d+1 


I^K^p.q.S + ^Ep.q) 


d+1 — 

e 


8(d+l)J( 1+ ^) 


where := d <*+! +(i i + 1 . 
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B.4 Proof of bounded supp(A) => © 

We prove that the boundedness of supp(A) implies that of / [see ( 1B.28H . specifically ©. 

Proof : Indeed, let 

/(z; w) = <9 pq fc(z) - u; p (—u;) q /i| p+q | (u; T z) = [ w p (-u;) q ft| p+q | (u; T z) dA(tu) - u^ p (-u;) q ft| p+q | (w T z) . 

_JR d 

Applying the triangle inequality and \h a (v)\ < 1 (Va, \/v) we have 

l/(z;w)|< [ s p (-s) q h 1p+q | (s T z) dA(s) + |w p (-u;) q /i| p+q | (w T z) | < f |s p (-s) q /z 1p+q | (s T z) | dA(s) + |w p+q 

J R d JR d 


(B.28) 


< 


=p+q| 


/R d 


dA(s) 


,p+q| - 


=p+q| 


/supp(A) 


dA(s) + |w p+q | < 2 sup |s 

sGsupp(A) 


p+q| 


K := sup sgsupp ( A ) |s p+q | is hnite since supp(A) is bounded, thus |/(z; u>)| is bounded. 

C Supplementary results 

In this section, we present some technical results that are used in the proofs. 

Lemma C.l (McDiarmid Inequality [7]). Let {XfjfL 1 be X-valued independent random variables. Suppose f : X m —> R 
satisfies the bounded difference property, 

sup |/(ui,...,u m )-/(tii,...,u r _i,u(.,u r+ i,...,u m )| < c r (Vr = 1,..., m). (C.l) 

Then for any e > 0, 


P {f{X u ..., X m ) — E [f(X u ..., X m )] > e) < e" 




Note: specifically, if c = c r (Vr) then applying a r = ( , 2 = yfigs -O- e = reparameterization one gets 


T f ) > 1 - e -r . 


P (/(ATp,..., X m ) < E [f(X u ..., A m )] + c 
Lemma C.2. For a > 1, f* V log f de < ^loga + 

Proof By change of variables, we have f () ^/log j de = a a s/tedt. Applying partial integration, we have 


/•oo r'oo -I 

/ Vie- 1 dt = [Vie-W 0 ' + / T7=e-‘ dt < 

Jloga Jloga C 

thereby yields the result. 




U 2\/log O Jloga 


s _t dt = 




a 2a\/log a ’ 


□ 


Lemma C.3 (Bernstein inequality [12]). Let £ G R be a random variable, Ej^p[^] = 0, and assume that 3L > 0,S > 0 
satisfying 

m ]\,f\q'2TM-2 

E[|£,| M ] < —-2- (VM > 2), 

3 =1 


w/zere (Ci)TLi 


i.i.d. 


J . Then for any 0 < to G N, r) > 0, 


£&■ 

j=i 


> zyS 1 < e 


.1 V 
2 i ' ^ 


Lemma C.4 (L r norm as countable supremum). Assume that 1 < f < oo. If (X, A, p), p(X) < oo, - + 4 = 1, f/zezz 
= {jF> : / G L r (X,A, p)}, where F f (u ) = J x ufdp, and \\f\\ Lr = ||P>|| (= sup Nff || ^ _ =1 \F f (g)\); see [8, 

Theorem 4.1], Specifically, if X = § C R d compact and it is endowed with the Borel cr-algebra, then by the separability of$, 
L r ($) is also separable [4, Prop. 3.4.5] since the Borel cr-algebra is countably generated [1, page 17 (vol. 2)], thus there exists 
a countable Q C L r ($), [3, Lemma 6.7] such that ||s , || A r(s) = 1 (Vg G Q) and \\Ff\\ = sup ag g \Ff(g)\. 

Note: the cr-algebra of Lebesgue measurable sets is typically not countably generated [1, page 106 (vol. I)]. 
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